07 - Diamond Challenge
By Dickson Kwong
Classwork
#install.packages("splitstackshape")
#install.packages("dplyr")
#install.packages("stringr")
#install.packages("ggplot2")
#install.packages("DAAG")
#install.packages("caret")
library("splitstackshape")
## Loading required package: data.table
library("plyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:data.table':
##
## between, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("stringr")
library("ggplot2")
library("caret")
## Loading required package: lattice
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
require(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
Today challenge demonstrate various uses for scatterplots and outline some strategies to help make sure key patterns are not obscured by the scale or qualitative group-level differences in the data (e.g., the relationship between diamond price and carat differs for clarity). The motivation in this challenge is to come up with a model of diamond prices that you can use to help make sure you can get a better deal when shopping for diamond as well as refresh your linear regression technique, specified based on insight from exploratory data analysis (EDA) combined with (somewhat) informed speculation. The diamond dataset was scrapped from Diamindse.Info by using some Python script in March 2015.
Diamond Carat Weight: a carat is a metric unit used to measure the weight of gemstones and pearls. One carat is equal to 200mg. Diamond weight below one carat can be expressed in points, each of which weighs 1/100th of a carat, or 2mg. For example, a half carat diamond could be expressed as “50 points”. “25 points” refers to 1/4 carat.
While white or colourless stones are traditionally used in diamond engagement rings, diamonds are found in a wide spectrum of colours. Engagement ring diamonds are graded from D to Z, with D being the prized colourless diamond. The hue of D, E and F coloured diamonds is difficult to differentiate to the untrained eye, but the amount of colour becomes more apparent as the alphabet progresses. Z graded diamonds are a pale yellow or brown colour, and anything falling outside of this range is considered a fancy coloured diamond. The colour of the metal in a mounting can either mask or enhance the diamond colour. Yellow gold makes slightly yellow or brown diamonds appear more colourless. If a diamond is mounted in white gold or platinum, the colour becomes more apparent
ost diamonds contain internal “birthmarks”, called inclusions, which make each diamond individual. There are 11 grades of diamonds, ranging from the rarest Flawless (FL) which contain no inclusions when viewed under 10x magnification, to I3, where the diamond contains inclusions easily apparent to the naked eye. Master cutters take into account these inclusions when shaping the diamond, and they don’t endanger the beauty or durability of the diamond. Diamonds graded SI2 and above are generally considered “eye clean” with these marks only visible under magnification.
Rather than describing the shape of a stone, cut refers to the proportion, symmetry brilliance and polish of a diamond. This means that the diamond cutter has taken into account the inherent characteristics of each individual stone and chose the best way to shape it to accentuate its beauty.
It is very essential to know Diamond Anatomy before understanding Cut.
Set the working directory:
setwd("/Users/DicksonK/Desktop/r_class/diamond/")
Here is where the data files at, and I import the file for you.
## I personally prefer to import with stringsASFactors as False
diamond_df = read.csv("/Users/DicksonK/Desktop/r_class/diamond/diamond_train_2015.csv", stringsAsFactors = F)
Let’s have a look at the summary:
summary(diamond_df)
## carat cut color clarity
## Min. :0.200 Length:751598 Length:751598 Length:751598
## 1st Qu.:0.350 Class :character Class :character Class :character
## Median :0.520 Mode :character Mode :character Mode :character
## Mean :0.779
## 3rd Qu.:1.010
## Max. :9.500
## table depth cert measurements
## Min. : 0.00 Min. : 0.0 Length:751598 Length:751598
## 1st Qu.:56.00 1st Qu.:60.9 Class :character Class :character
## Median :58.00 Median :62.1 Mode :character Mode :character
## Mean :56.98 Mean :60.5
## 3rd Qu.:59.00 3rd Qu.:62.7
## Max. :85.00 Max. :81.6
## price
## Min. : 136
## 1st Qu.: 810
## Median : 1670
## Mean : 4867
## 3rd Qu.: 5140
## Max. :99978
Let’s have a look at the first few rows:
head(diamond_df)
## carat cut color clarity table depth cert measurements price
## 1 1.01 V.Good J VS1 55 64.0 GIA 6.35 x 6.32 x 4.05 4940
## 2 1.24 Ideal F VS1 59 61.6 GIA 6.87 x 6.91 x 4.25 9582
## 3 0.30 V.Good F VS2 59 63.9 GIA 4.21 x 4.16 x 2.67 750
## 4 0.60 V.Good E VS2 57 62.9 GIA 5.35 x 5.33 x 3.36 2359
## 5 0.31 Ideal D VS2 58 61.2 GIA 4.37 x 4.34 x 2.67 940
## 6 0.60 Good I SI1 62 60.2 GIA 5.39 x 5.45 x 3.27 1570
## Set the working directory:
setwd("/Users/DicksonK/Desktop/r_class/diamond/")
## setwd("W://Dept//COE-Ent//04 Projects & Systems//Data_Science//WWTBDS//Pre_section_2//data//")
## Load the dataset
diamond_df = read.csv("diamond_train_2015_clean.csv", stringsAsFactors = F)
## Check number of rows
nrow(diamond_df)
## [1] 741349
## Just take a look
head(diamond_df)
## carat cut color clarity table depth cert measurements_1
## 1 1.01 V.Good J VS1 55 64.0 GIA 6.35
## 2 1.24 Ideal F VS1 59 61.6 GIA 6.87
## 3 0.30 V.Good F VS2 59 63.9 GIA 4.21
## 4 0.60 V.Good E VS2 57 62.9 GIA 5.35
## 5 0.31 Ideal D VS2 58 61.2 GIA 4.37
## 6 0.60 Good I SI1 62 60.2 GIA 5.39
## measurements_2 measurements_3 price
## 1 6.32 4.05 4940
## 2 6.91 4.25 9582
## 3 4.16 2.67 750
## 4 5.33 3.36 2359
## 5 4.34 2.67 940
## 6 5.45 3.27 1570
## Check the summary
summary(diamond_df)
## carat cut color clarity
## Min. :0.2000 Length:741349 Length:741349 Length:741349
## 1st Qu.:0.3500 Class :character Class :character Class :character
## Median :0.5200 Mode :character Mode :character Mode :character
## Mean :0.7776
## 3rd Qu.:1.0100
## Max. :9.5000
## table depth cert measurements_1
## Min. : 0.50 Min. : 0.50 Length:741349 Min. : 0.220
## 1st Qu.:56.00 1st Qu.:61.00 Class :character 1st Qu.: 4.530
## Median :58.00 Median :62.10 Mode :character Median : 5.170
## Mean :57.36 Mean :61.28 Mean : 5.608
## 3rd Qu.:59.00 3rd Qu.:62.70 3rd Qu.: 6.410
## Max. :85.00 Max. :81.60 Max. :7032.000
## measurements_2 measurements_3 price
## Min. : 1.000 Min. : 0.550 Min. : 136
## 1st Qu.: 4.540 1st Qu.: 2.800 1st Qu.: 810
## Median : 5.180 Median : 3.200 Median : 1660
## Mean : 5.601 Mean : 3.473 Mean : 4851
## 3rd Qu.: 6.420 3rd Qu.: 3.990 3rd Qu.: 5122
## Max. :738.000 Max. :483.000 Max. :99978
nrow(diamond_df[diamond_df$carat > 0.75 & diamond_df$price <= 1000, ])
## [1] 56
range(diamond_df[diamond_df$cut == "Ideal" & diamond_df$color %in% c("D", "E", "F") & diamond_df$clarity %in% c("FL", "LF", "VVS1", "VVS2"), "price"])
## [1] 309 99880
sd(diamond_df[diamond_df$carat == 1.0 & diamond_df$color %in% c("D", "E", "F") & diamond_df$cert == "GIA", "price"])
## [1] 2165.788
median(diamond_df[diamond_df$carat == 0.51, "price"])
## [1] 1624
freq_carat = as.data.frame(table(diamond_df$carat))
freq_carat = freq_carat[order(-freq_carat$Freq), ]
top10_carat = freq_carat[1:10, ]
top10_carat
## Var1 Freq
## 11 0.3 90692
## 21 0.4 54761
## 31 0.5 43896
## 82 1.01 30435
## 12 0.31 27808
## 81 1 23386
## 51 0.7 19996
## 13 0.32 17800
## 32 0.51 14831
## 71 0.9 14815
## Save the correlation as a dataframe
price_cor = as.data.frame.matrix(cor(diamond_df[sapply(diamond_df, is.numeric)]))["price"]
price_cor["feature"] = row.names(price_cor)
rownames(price_cor) = NULL
price_cor = price_cor[c(2,1)]
price_cor
## feature price
## 1 carat 0.86607919
## 2 table 0.04421053
## 3 depth 0.03797440
## 4 measurements_1 0.11267028
## 5 measurements_2 0.32053443
## 6 measurements_3 0.29598735
## 7 price 1.00000000
## Remove price itself and just sort it so it looks nicer
price_cor = price_cor[price_cor$feature != "price", ]
price_cor = price_cor[order(-price_cor$price), ]
price_cor
## feature price
## 1 carat 0.86607919
## 5 measurements_2 0.32053443
## 6 measurements_3 0.29598735
## 4 measurements_1 0.11267028
## 2 table 0.04421053
## 3 depth 0.03797440
## Filter those over 0.4
price_cor = price_cor[price_cor$price > 0.4, ]
price_cor
## feature price
## 1 carat 0.8660792
## Base plot
ggplot(diamond_df, aes(x=carat, y=price)) +
geom_point(shape=1, alpha = 0.3) +
ggtitle('Price by Carat')
col_list = c("cut", "color", "clarity", "cert")
plot_list = NULL
for (col_name in col_list) {
plot_list[[which(col_list==col_name)]] = ggplot(diamond_df, aes_string(x="carat", y="price", color=col_name)) +
geom_point(alpha = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
ggtitle(paste("Price by Carat", col_name, sep=" - "))
}
multiplot(plot_list[[1]], plot_list[[2]], plot_list[[3]], plot_list[[4]], cols=2)
## Loading required package: grid
## Color & clarity
rm(plot_list)
## geom_histogram
ggplot(diamond_df, aes(x=price)) + geom_histogram(colour = "blue", fill = "yellow", binwidth=500) +
ggtitle('Diamond Price Distributions')
## geom_density
ggplot(diamond_df, aes(x=price)) + geom_density() + ggtitle('Diamond Price Distributions')
## The following loop will use each feature in the dataset and try to use linear regression to predict the diamond price. R-square will be saved to a list
rs_list = NULL
for (col_name in names(diamond_df[1:(ncol(diamond_df)-1)])){
rs_list[which(names(diamond_df[1:(ncol(diamond_df)-1)]) == col_name)] = summary(lm(as.formula(paste("price ~ ", col_name, sep = "")), data = diamond_df))$r.squared
}
## Put the features and R-square into a dataframe
rs_df = data.frame(feature = names(diamond_df[1:(ncol(diamond_df)-1)]), r_square = rs_list)
rs_df
## feature r_square
## 1 carat 0.750093167
## 2 cut 0.004269707
## 3 color 0.009666985
## 4 clarity 0.015017332
## 5 table 0.001954571
## 6 depth 0.001442055
## 7 cert 0.013598743
## 8 measurements_1 0.012694592
## 9 measurements_2 0.102742322
## 10 measurements_3 0.087608512
## Just sort it so it looks nicer
rs_df = rs_df[order(-rs_df$r_square), ]
rs_df
## feature r_square
## 1 carat 0.750093167
## 9 measurements_2 0.102742322
## 10 measurements_3 0.087608512
## 4 clarity 0.015017332
## 7 cert 0.013598743
## 8 measurements_1 0.012694592
## 3 color 0.009666985
## 2 cut 0.004269707
## 5 table 0.001954571
## 6 depth 0.001442055
## Feature with highest R
rs_df[which(rs_df$r_square == max(rs_df$r_square)), ]
## feature r_square
## 1 carat 0.7500932
## Create a linear regression model by using carat as a feature
lr_carat = lm(price ~ carat, data = diamond_df)
## Diagnostics plot
ggplot(data.frame(x=predict(lr_carat),y=residuals(lr_carat)), aes(x=predict(lr_carat),y=residuals(lr_carat))) +
geom_point() + xlab("Fitted values") + ylab("Residuals") + geom_hline(yintercept=0) + geom_smooth() +
ggtitle("Residuals vs Fitted")
ggplot(lr_carat$model, aes_string(x = names(lr_carat$model)[2], y = names(lr_carat$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(lr_carat)$adj.r.squared, 5),
"Intercept =",signif(lr_carat$coef[[1]],5 ),
" Slope =",signif(lr_carat$coef[[2]], 5),
" P =",signif(summary(lr_carat)$coef[2,4], 5)))
| id | price_est |
|---|---|
| 1 | 804 |
| 2 | 204 |
| 3 | 2948 |
| … | … |
One way to analyze data which does not follow a normal distribution is to try to transform the data points into a distribution which is approximately normal. In other words, take the original data X and apply some mathematical operation to each data point to make a set of data X’
There are many different possible transformations – in fact any mathematical operation which results in a monotonic (i.e. continuously increasing or decreasing) and one-to-one relationship with the original data can result in a valid data transformation. The following will focus on a few which are often helpful.
The most often used data transformation is the logarithmic transformation. Here each data point is converted into its natural log.
X’ = ln[X]
This transformation is useful when the data distribution is skewed right:
It is also useful when the means and variances of different samples seem to be correlated (i.e., that the samples with higher means tend to have higher variances.)
A log scale is additive for cases in which the untransformed scale is multiplicative. therefore many morphological characters, subject to multiplicative growth curves, tend to show log-normal distributions. The log transform is therefore very important in biology.
Data which involve counts are often more normally distributed when a square-root transformation is applied.
When the data points are each themselves proportions, the distribution of these is often skewed. An arcsine transformation often makes these distributions more normal.